22 research outputs found

    Qd-tree: Learning Data Layouts for Big Data Analytics

    Full text link
    Corporations today collect data at an unprecedented and accelerating scale, making the need to run queries on large datasets increasingly important. Technologies such as columnar block-based data organization and compression have become standard practice in most commercial database systems. However, the problem of best assigning records to data blocks on storage is still open. For example, today's systems usually partition data by arrival time into row groups, or range/hash partition the data based on selected fields. For a given workload, however, such techniques are unable to optimize for the important metric of the number of blocks accessed by a query. This metric directly relates to the I/O cost, and therefore performance, of most analytical queries. Further, they are unable to exploit additional available storage to drive this metric down further. In this paper, we propose a new framework called a query-data routing tree, or qd-tree, to address this problem, and propose two algorithms for their construction based on greedy and deep reinforcement learning techniques. Experiments over benchmark and real workloads show that a qd-tree can provide physical speedups of more than an order of magnitude compared to current blocking schemes, and can reach within 2X of the lower bound for data skipping based on selectivity, while providing complete semantic descriptions of created blocks.Comment: ACM SIGMOD 202

    Physical Database Design Decision Algorithms and Concurrent Reorganization for Parallel Database Systems

    No full text
    Stringent performance requirements in DB applications have led to the use of parallelism for database processing. To allow the database system to take advantage of the performance of parallel shared-nothing systems, the physical DB design must be appropriate for the DB structure and the workload. We develop decision algorithms that will select a good physical DB design both when the DB is first loaded into the system (static decision) and while the DB is being used by the workload (dynamic decision). Our decision algorithms take the database structure, workload, and system characteristics as inputs. The static (or initial) physical DB design decision algorithm involves: • selecting a partitioning attribute for each relation that determines how the relation is fragmented across the nodes (allowing for high I/O bandwidth); • selecting indexes on the relation attributes to allow faster accesses compared to sequential file scans; • selecting the attributes by which to cluster a relation in order to take advantage of the prefetching and caching involved in I/O access; • grouping of relations to allow DB operations (joins) on relation pairs to be executed locall

    Data Reorganization in Parallel Database Systems

    No full text
    Parallel database systems are suitable for use in applications with high capacity and high performance and availability requirements. The trend in such systems is to provide efficient on-line capability for performing various system administration functions such as, index creation and maintenance, backup/restore, reorganization, and gathering of statistics. For some of these functions, the on-line capability can be efficiently supported by the use of "incremental algorithms", i.e., algorithms that achieve the function in several, relatively small (i.e., less time-consuming) steps, rather than in a single, large step. Incremental algorithms ensure that only small parts of the database become inaccessible for short durations as opposed to nonincremental algorithms which may lock large portions of the database or the entire database for a longer duration. In this paper, we discuss issues in providing concurrent data reorganization capability using incremental algorithms in parallel databa..
    corecore